Basic pattern syntax
The simplest use of a regular expression is to match a literal string.
| This regular expression |
matches the text shown in red |
| simple |
a simple file name.txt |
| file |
a simple file name.txt |
The following are special characters:
\ [ ] { } ( ) $ ^ * . ? : - +
Special characters are not compared literally with characters in the text for matching purposes, but control matching in other ways. For example, a period (.) matches any character:
| This regular expression |
matches the text shown in red |
| .m |
a simple text file name.txt |
| .m. |
a simple text file name.txt |
Because they have special meanings, special characters must be escaped using a backslash (\) to match them:
| This regular expression |
matches the text shown in red |
| (simple) |
a (simple) file name.txt |
| \(simple\) |
a (simple) file name.txt |
To match a pattern only at the beginning of the subject string, start the pattern with a circumflex character (^).
| This regular expression |
matches the text shown in red |
| a |
a simple text file name.txt |
| ^a |
a simple text file name.txt |
To match a pattern only at the end of the subject string, end the pattern with a dollar sign ($):
| This regular expression |
matches the text shown in red |
| xt |
a simple text file name.txt |
| xt$ |
a simple text file name.txt |
| .x.$ |
AppleCuda.kext
To Do List.txt
URLMountUIProxy
|
To match only the entire string, use both (i.e. “^hello\.$”).
| This regular expression |
matches the text shown in red |
| hello |
hello hello
|
| ^hello |
hello hello
|
| hello$ |
hello hello
|
| ^hello$ |
hello hello
|
| ^hello$ |
hello
|
An unescaped period (.) matches any character.
Classes
To match a given set of characters, such as whitespace, only alphabetic characters, and only numeric characters, use a character class. A character class is a set of characters enclosed in brackets ([]). Any character in the brackets will match. Ranges are allowed and are interpreted in ASCII order. If a circumflex (^) is the first character, the class is negated, so that any character not in the brackets will match. There are also several special classes which match a predefined set of characters, and shortcuts for those classes. A partial list of common special classes and shortcuts is:
| [:white:], \s, [ \t\n\r] |
The set of standard whitespace characters: space, tab, carriage return, and newline. |
| [:alpha:], [A-Za-z] |
Any alphabetic character. |
| [:alnum:], [A-Za-z0-9] |
Any alphanumeric character. |
| [:digit:], \d, [0-9] |
Any numeric character. |
| This regular expression |
matches the text shown in red |
| [A-Za-z0-9] |
March 16, 1955 |
| [A-Z0-9] |
March 16, 1955 |
| [aeiou] |
new catalog.doc |
Repetition
To match a given character or class more than once, follow it with a repetition modifier. The modifiers are:
| {n,m} |
Match the character between n and m times, inclusive. |
| {n} |
Match the character between n and m times, inclusive. |
| {n,} |
Match the character n or more times, inclusive. |
| {,n} |
Match the character between zero and n times, inclusive. Exactly the same as {0,n} |
| * |
Match the character zero or more times. Exactly the same as {0,} |
| ? |
Match the character either zero or one times. Exactly the same as {0,1} or {,1} |
| + |
Match the character one or more times. Exactly the same as {1,} |
Subpatterns
To apply a modifier to more than one character, or to capture a substring for later use in replacement, use a subpattern. Any part of a regular expression enclosed in parenthesis is considered a subpattern. Subpatterns are numbered from left to right, starting from one. By default, all subpatterns are capturing, which means they count in the list of subpatterns and can be used for conditional evaluation and replacement. By using (?:) instead of (), a subpattern can be made non-capturing. Such a subpattern is used for grouping purposes only and is not numbered.
Replacing with backreferences
A subpattern that will be used used later is called a backreference. In a replacement string, “$n\” refers to the nth subpattern in the pattern string. Let’s look at an example of a replacement using backreferences in which we replace:
See (.+) run with (.+)
with:
Watch $2\ walk with $1\
| Original |
After replacement |
| See Jane run with Mary |
See Mary walk with Jane |
Examples
Consider the following expression:
^(H([[:alpha:]]{4}), (wor(?:l.)))$
This regular expression contains three numbered subpatterns. They are:
- H([[:alpha:]]{4}), (wor(?:l.))
- [[:alpha:]]{4}
- wor(?:l.)
It will match all of the following strings (among many others):
- Hello, world
- Hiiii, world
- Hello, worlt
- Haaaa, worlg